Groupby operations

Some imports:


In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

pd.options.display.max_rows = 10

Recap: the groupby operation (split-apply-combine)

The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

Similar to SQL GROUP BY

The example of the image in pandas syntax:


In [ ]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:

df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...

But pandas provides the groupby method to do this:


In [ ]:
df.groupby('key').aggregate('sum')  # np.sum

In [ ]:
df.groupby('key').sum()

Pandas does not only let you group by a column name. In df.groupby(grouper) can be many things:

  • Series (or string indicating a column in df)
  • function (to be applied on the index)
  • dict : groups by values
  • levels=[], names of levels in a MultiIndex

In [ ]:
df.groupby(lambda x: x % 2).mean()

And now applying this on some real data

These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.

cast dataset: different roles played by actors/actresses in films

  • title: title of the film
  • name: name of the actor/actress
  • type: actor/actress
  • n: the order of the role (n=1: leading role)

In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()

In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
EXERCISE: Using groupby(), plot the number of films that have been released each decade in the history of cinema.

In [ ]:
# %load snippets/04b - Advanced groupby operations8.py
EXERCISE: Use groupby() to plot the number of "Hamlet" films made each decade.

In [ ]:
# %load snippets/04b - Advanced groupby operations9.py
EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?

In [ ]:
# %load snippets/04b - Advanced groupby operations10.py
EXERCISE: List the 10 actors/actresses that have the most leading roles (n=1) since the 1990's.

In [ ]:
# %load snippets/04b - Advanced groupby operations11.py
EXERCISE: Use groupby() to determine how many roles are listed for each of The Pink Panther movies.

In [ ]:
# %load snippets/04b - Advanced groupby operations12.py
EXERCISE: List, in order by year, each of the films in which Frank Oz has played more than 1 role.

In [ ]:
# %load snippets/04b - Advanced groupby operations13.py
EXERCISE: List each of the characters that Frank Oz has portrayed at least twice.

In [ ]:
# %load snippets/04b - Advanced groupby operations15.py

Transforms

Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with transform:


In [ ]:
df

In [ ]:
df.groupby('key').transform('mean')

In [ ]:
def normalize(group):
    return (group - group.mean()) / group.std()

In [ ]:
df.groupby('key').transform(normalize)

In [ ]:
df.groupby('key').transform('sum')
EXERCISE: Add a column to the `cast` dataframe that indicates the number of roles for the film.

In [ ]:
# %load snippets/04b - Advanced groupby operations21.py
EXERCISE: Calculate the ratio of leading actor and actress roles to the total number of leading roles per decade.

Tip: you can to do a groupby twice in two steps, once calculating the numbers, and then the ratios.


In [ ]:
# %load snippets/04b - Advanced groupby operations22.py

In [ ]:
# %load snippets/04b - Advanced groupby operations23.py

In [ ]:
# %load snippets/04b - Advanced groupby operations24.py

Intermezzo: string manipulations

Python strings have a lot of useful methods available to manipulate or check the content of the string:


In [ ]:
s = 'Bradwurst'

In [ ]:
s.startswith('B')

In pandas, those methods (together with some additional methods) are also available for string Series through the .str accessor:


In [ ]:
s = pd.Series(['Bradwurst', 'Kartoffelsalat', 'Sauerkraut'])

In [ ]:
s.str.startswith('B')

For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

EXERCISE: We already plotted the number of 'Hamlet' films released each decade, but not all titles are exactly called 'Hamlet'. Give an overview of the titles that contain 'Hamlet', and that start with 'Hamlet':

In [ ]:
# %load snippets/04b - Advanced groupby operations29.py

In [ ]:
# %load snippets/04b - Advanced groupby operations30.py
EXERCISE: List the 10 movie titles with the longest name.

In [ ]:
# %load snippets/04b - Advanced groupby operations31.py

In [ ]:
# %load snippets/04b - Advanced groupby operations32.py

Value counts

A useful shortcut to calculate the number of occurences of certain values is value_counts (this is somewhat equivalent to df.groupby(key).size()))

For example, what are the most occuring movie titles?


In [ ]:
titles.title.value_counts().head()
EXERCISE: Which years saw the most films released?

In [ ]:
# %load snippets/04b - Advanced groupby operations34.py
EXERCISE: Plot the number of released films over time

In [ ]:
# %load snippets/04b - Advanced groupby operations35.py
EXERCISE: Plot the number of "Hamlet" films made each decade.

In [ ]:
# %load snippets/04b - Advanced groupby operations36.py
EXERCISE: What are the 11 most common character names in movie history?

In [ ]:
# %load snippets/04b - Advanced groupby operations37.py
EXERCISE: Which actors or actresses appeared in the most movies in the year 2010?

In [ ]:
# %load snippets/04b - Advanced groupby operations38.py
EXERCISE: Plot how many roles Brad Pitt has played in each year of his career.

In [ ]:
# %load snippets/04b - Advanced groupby operations39.py
EXERCISE: What are the 10 most film titles roles that start with the word "The Life"?

In [ ]:
# %load snippets/04b - Advanced groupby operations40.py
EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?

In [ ]:
# %load snippets/04b - Advanced groupby operations41.py

In [ ]:
# %load snippets/04b - Advanced groupby operations42.py

In [ ]: